Method for image region classification using unsupervised and supervised learning

ABSTRACT

A method for classification of image regions by probabilistic merging of a class probability map and a cluster probability map includes the steps of a) extracting one or more features from an input image composed of image pixels; b) performing unsupervised learning based on the extracted features to obtain a cluster probability map of the image pixels; c) performing supervised learning based on the extracted features to obtain a class probability map of the image pixels; and d) combining the cluster probability map from unsupervised learning and the class probability map from supervised learning to generate a modified class probability map to determine the semantic class of the image regions. In one embodiment the extracted features include color and textual features.

FIELD OF THE INVENTION

[0001] The invention relates generally to the processing andclassification of images, and in particular to the semanticclassification of image regions.

BACKGROUND OF THE INVENTION

[0002] With the increasing use of digital imaging in general consumerapplications, efficient management and organization of the image datahas become important. The number of images even in a personal collectionof a typical consumer can be fairly large. The concept of Auto-Albumingis a key step towards reducing the cost, time and efforts in organizingsuch large image databases. In particular, a semi-automaticevent-clustering scheme may be used to sort a set of pictures intodifferent groups, where each group contains similar pictures. Suchschemes work on spatial color distribution and use process-intensivemerging algorithms to group similar images together. However, suchalgorithms do not tell anything about the semantic class (e.g., beach,birthday party, swimming pool, etc.) to which each of these groupsbelongs. Thus, the next step needed for the automation of the albumingprocess is to find the semantic classification of the event contained ina group of ‘similar’ pictures.

[0003] After going through a large database of consumer pictures, it wasobserved that a majority of the groups of ‘similar’ images comes fromthe following classes: baby pictures, wedding party, birthday party,convocation, picnic, landscape, city pictures, beach, swimming pool andocean view. Of course, the list is neither exhaustive nor mutuallyexclusive, i.e., there ordinarily are several pictures which may beclassified under more than one event. In addition, the classification,in some cases, may be subjective. Hence, the task of eventclassification for very generic scenes is a very difficult problem. Itrequires not only the knowledge of the image regions or objects but alsothe semantic information contained in their emotional and spatialarrangement. Considering the state of the art research in computervision, solving this problem is an enormous task. However,classification of the images belonging to most of the natural outdoorscenes, e.g., landscape, beach, swimming pool, garden, ocean view, etc.is mostly based on a few ‘natural’ regions in the image. Examples ofthese natural regions include water, sky, grass, sand, skin, etc.Although these regions show wide variations in their appearances, theycan be captured to a certain extent by using simple features such ascolor, texture, shape, etc. While the present work deals only with thenatural scene images, the proposed scheme can be modified to incorporatemore high-level features, e.g., face location, etc., to widen its scope.

[0004] The main motivation for the present invention comes from theparadox of scene (or event) classification. In absence of any a-prioriinformation, the scene classification task requires the knowledge ofregions and objects contained in the image. On the other hand, it isincreasingly being recognized in vision community that contextinformation is necessary for reliable extraction of knowledge of theimage regions and objects.

[0005] It would be useful to be able to represent the semanticclassification of each pixel in an image. A deterministic approach wouldentail the reclassification of image regions from the beginning, and itis not very clear how one would be able to encode the contextinformation efficiently in a deterministic framework. However, insteadof employing a deterministic model and, e.g., assigning each pixel toone of the classes in a recognition vocabulary, a probabilisticframework would seem to offer more promise. What is needed is atechnique that would effectively generate a class probability over theinput image, which would represent the probability of each pixel havingcome from a given class.

SUMMARY OF THE INVENTION

[0006] The present invention is directed to overcoming one or more ofthe problems set forth above. Briefly summarized, according to oneaspect of the present invention, the invention resides in a method forclassification of image regions by probabilistic merging of a classprobability map and a cluster probability map. The method includes thesteps of a) extracting one or more features from an input image composedof image pixels; b) performing unsupervised learning based on theextracted features to obtain a cluster probability map of the imagepixels; c) performing supervised learning based on the extractedfeatures to obtain a class probability map of the image pixels; and d)combining the cluster probability map from unsupervised learning and theclass probability map from supervised learning to generate a modifiedclass probability map to determine the semantic class of the imageregions. In one embodiment the extracted features include color andtextual features.

[0007] The invention employs an iterative feedback scheme. At first,image regions are classified in different semantic natural classes.Then, multiple hypotheses are generated about the scene using theclassified regions. The scene hypotheses, in turn, allow generation ofcontext information. The scene contexts are used to further refine theclassification of image regions. Under this paradigm, an obvious choiceof region classification scheme is one that allows easy modification ofinitial classification without classifying the regions afresh.Probabilistic classification of regions can provide great flexibility infuture refinement using Bayesian approach as the context information canbe encoded as improved priors.

[0008] These and other aspects, objects, features and advantages of thepresent invention will be more clearly understood and appreciated from areview of the following detailed description of the preferredembodiments and appended claims, and by reference to the accompanyingdrawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009]FIG. 1 is a block diagram of a technique for probabilisticclassification of image regions according to the invention.

[0010]FIG. 2 shows an input image.

[0011]FIG. 3 shows the input image from FIG. 2 in g-RGB space.

[0012]FIG. 4 shows the texture strength for the input image from FIG. 2.

[0013]FIG. 5 shows the color histogram in g-RGB space for the inputimage from FIG. 2.

[0014]FIG. 6 shows the change in DL Divergence as the number ofcomponents are increased.

[0015] FIGS. 7(a)-(e) show cluster probability maps for differentclusters of the input image shown in FIG. 2.

[0016]FIG. 8 shows a mixture example of a Gaussian for a “sky” class.

[0017]FIG. 9 shows a class probability map for the input image shown inFIG. 2.

[0018] FIGS. 10(a)-(e) show the posterior probability for the fivecluster probability maps shown in FIGS. 7(a)-(e).

[0019]FIG. 11 shows a modified class probability map for the classprobability map shown in FIG. 9.

[0020]FIG. 12 is a perspective diagram of a computer system forimplementing the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0021] Because image processing systems employing classification ofimages are well known, the present description will be directed inparticular to attributes forming part of, or cooperating more directlywith, the method in accordance with the present invention. Attributesnot specifically shown or described herein may be selected from thoseknown in the art. In the following description, a preferred embodimentof the present invention would ordinarily be implemented as a softwareprogram, although those skilled in the art will readily recognize thatthe equivalent of such software may also be constructed in hardware.Given the method as described according to the invention in thefollowing materials, software not specifically shown, suggested ordescribed herein that is useful for implementation of the invention isconventional and within the ordinary skill in such arts.

[0022] If the invention is implemented as a computer program, theprogram may be stored in conventional computer readable storage medium,which may comprise, for example; magnetic storage media such as amagnetic disk (such as a floppy disk or a hard drive) or magnetic tape;optical storage media such as an optical disc, optical tape, or machinereadable bar code; solid state electronic storage devices such as randomaccess memory (RAM), or read only memory (ROM); or any other physicaldevice or medium employed to store a computer program.

[0023] The preferred technique for probabilistic classification of imageregions is shown in FIG. 1. The main aim of this technique is to find aclass probability map over the input image representing the probabilityof each pixel to have come from a given class. As the first step,several features are extracted in a feature extraction stage 12 from aninput color image 10. This step is key to the future processing of theimage 10 since the image is now represented by these features. Thenature of these features may vary according to their interpretationalpower from low level feature information such as color, texture, shapes,wavelet coefficients, etc. to higher, semantic-level feature informationsuch as location of faces, people, structure, etc. The type of featuresto be extracted from an image depends on the nature of the sceneclassification task. For instance, if the scene is to be classified as awedding party, there would be more interest in high level informationsuch as the location of faces, people, the color distribution of dressesworn by the people, etc. As stated earlier, low-level features revealgood representational power for the region classification of naturalscenes, which are the scenes of interest for the current embodiment ofthe invention. Most common techniques are either based on maximizationof mutual information or some sort of statistical test of dependencebetween the classes and the features. Features that show high mutualinformation or dependence are chosen as good features.

[0024] Once the features have been extracted and selected, the next stepis to use stages 14 and 16 of unsupervised and supervised learning,respectively, to obtain cluster and class probability maps 26 and 28,respectively. The main reason for using both learning techniques is thatunsupervised learning 14 (which herein includes, without limitation,clustering) selects how many clusters there are in the image (in acomponent selection stage 18) and employs a clustering algorithm 20 tocluster the similar pixels in distinct groups, but does not account forsemantic similarity between the pixels while clustering. Pixelsbelonging to different classes may get clustered in the same groupdepending on the composition of pixel data in the feature space. On theother hand, the supervised learning 16 employs a generative model 22 topredict the semantic similarity of each pixel with the class data, butdoes not enforce the regional similarity between pixels to obtain abetter hypothesis. In addition, the supervised learning 16 is based onlabeled class training data 24 that may include wide variations in pixelappearances due to different physical conditions, e.g., sharpillumination changes. This usually leads to several false positives inclassification. Neither of the learning schemes is perfect. But thisobservation reveals the potential of merging the two learning paradigmsto obtain a better, modified class probability map(s) 30. This can bedone by enforcing the pixel similarity in the input image considered byclustering on the probabilistic classification results obtained by thesupervised learning. Both the learning techniques and their merger willbe discussed in detail in later sections.

[0025] Color Features

[0026] Color is an important component of the natural scene classes.Color based features are particularly desirable for reliableclassification of image regions. To extract the color features, thecolor must be represented in a suitable space. Numerous color spaceswhich encode color information effectively have been reported in theliterature. For the preferred embodiment, five conventional color spaceswere tested: HSV, rectangular HSV, generalized RGB (g-RGB or rg), RGBratio (r-RGB) and LST (an equivalent of Luv). A suitable color spacemust:

[0027] Provide enough similarity between the pixels belonging to thesame class.

[0028] Provide enough discrimination between the pixels belonging todifferent classes.

[0029] Be able to factor out the variations of illuminant brightnesseffectively.

[0030] HSV space has good power for separating chrominance and luminancechannels so as to factor out the effects of luminance. However, theproblem with this space is that perceptually equal change in hue is notequivalent to equal change in saturation. For a given color in HSV space(h,s,v), rectangular HSV coordinates are given by ( sv cos(2πh) , svsin(2πh), v). This encoding utilizes the fact that hue differences aremeaningless for very small saturations. However, this encoding schemeignores the fact that for large values and saturations, hue differencesare more perceptually relevant than saturation and value differences.Generalized RGB (g-RGB) and RGB ratio (r-RGB) spaces tend to normalizethe effects of luminance. The g-RGB space is given by two coordinates,$\left( {{r = \frac{R}{R + G + B}},{g = \frac{G}{R + G + B}}} \right)$

[0031] where (R,G,B) are the coordinates in the RGB space. It is clearthat one degree of freedom is lost in this conversion since the thirdcoordinate of this space is simply (1−r-g). In the r-RGB space thecoordinates are given by$\left( {{g = \frac{G}{R}},{b = \frac{B}{R}}} \right).$

[0032] As in g-RGB space, here also one degree of freedom is lost. Thisloss of one degree of freedom is true for almost all color spaces, whicheither normalize or dispense with the luminance information. The lastspace, LST color space, is a variant of CIE Luv color space. This spaceis reported to be close to human perception of color.

[0033] All the above spaces inherently assume that the luminance signalis independent of the chrominance signals. However there exist a fewexperiments which support the evidence that for several colors, changein hue is related to the change in luminance. For further details, athorough review of various color models has been presented in R. B.Norman, Electronic Color, VanNostrand Reinhold Press, 1990. In thepresent embodiment, the results with the last four spaces, i.e.,rectangular HSV, g-RGB, r-RGB and LST, were found to be fairly similar.g-RGB space is chosen as it shows good luminance normalization and allthe coordinates vary between 0 to 1, making it attractive for furthermathematical computations. An input image is given in FIG. 2. Thecorresponding image in g-RGB space is shown in FIG. 3.

[0034] Texture Features

[0035] In contrast to color, texture is a not a point property. Instead,it is defined over a spatial neighborhood. Even though analysis oftexture started decades ago, there is still no universally accepteddefinition of texture. According to a general notion of texture,however, texture can provide good discrimination between natural classesthat resemble each other in color domain, e.g., water and sky. However,texture should be dealt with more care, as any single class does nothave a unique texture. Within a semantically coherent region, theremight be areas of high or low textures, with different scales anddirectional uniformity. Textural details can be used in a strict mannerto discriminate between textural patterns as given in Brodatz Album (P.Brodatz: Textures: A Photographic Album for Artists and Designers, DoverPublications, N.Y., 1966). But in the classification task, a very strongtexture measure can sometimes undo the good work done by color features.

[0036] Several techniques have been reported in the literature tocompute the texture in a pixel neighborhood. The famous ones includeMultiresolution Simultaneous Autoregressive (MSAR) model, Gabor Waveletsand the Second Moment Eigenstructure (SME). It has been shown byManjunath et al. in “Texture Features for Browsing and Retrieval ofImage Data”, IEEE Trans. Pattern Analysis Machine Intelligence, vol. 18,No. 8, 1996, which is incorporated herein by reference, thatmultiresolution Gabor wavelets are faster and provide slightly betterresults than the MSAR model. Also, Gabor wavelets can be computed atvarying resolutions and orientations which makes it a strong texturemeasure. There are two main problems while using Gabor wavelets: (a) useof a strong texture measure sometimes interferes with the functioning ofthe color features, and (b) Gabor wavelets are effective when used atmultiple scales and orientations, which leads to an increase in thedimensions of the feature space. With this, the chances of thenon-linear optimization techniques, used in learning, to get stuck inlocal extremum also increase. A result related to unsupervisedclustering with Gabor features will be shown later.

[0037] The Second Moment Eigenstructure (SME) yields a slightly weakermeasure of texture than the other methods discussed above, but itcaptures the essential neighborhood characteristics of a pixel. (The SMEis described in Carson C., Belongie S., Grennspan H., and Malik J.,“Region-Based Image Querying”, CVPR '97, Workshop on Content-BasedAccess of Image and Video Libraries, 1997 and Sochen N., Kimmel R., andMalladi R., “A General Framework for Low Level Vision', IEEE Trans. onImage Processing, 1999, which are incorporated herein by reference.) Thesecond moment matrix is given as: $\begin{matrix}{{{Second}\quad {Moment}\quad {Matrix}} = \begin{bmatrix}{\sum\limits_{c}{\sum\limits_{W}I_{x}^{c^{2}}}} & {\sum\limits_{c}{\sum\limits_{W}{I_{x}^{c}I_{y}^{c}}}} \\{\sum\limits_{c}{\sum\limits_{W}{I_{x}^{c}I_{y}^{c}}}} & {\sum\limits_{c}{\sum\limits_{W}I_{y}^{c^{2}}}}\end{bmatrix}} & \left( {{Eq}.\quad 1} \right)\end{matrix}$

[0038] where I_(k) ^(c) is the gradient of the image in spatialdirection k over the color channel c for k=x, y, and c=R, G, B. W is thewindow over which these gradients are summed. Gaussian weighting is usedin window W around the pixel of interest to give more weight to pixelsnear it. Also, the spatial derivatives are computed using 1D derivativeof Gaussians in x and y directions. The texture from SME is called‘colorized texture’ because the above matrix captures the texture incolor space instead of usual gray level, or intensity, space. The secondmoment matrix can be shown as a modification of bilinear, symmetricpositive definite metric defined over the 2D image manifold embedded ina 5D space of (R, G, B, x, y) (see the aforementioned article by Sochenet al.). The eigenstructure of the second moment matrix represents thetextural properties. Two measures have been defined (see theaforementioned article by Carson et al.) using the eigenvalues of thematrix, (a) anisotropy=1 −λ₂/λ₁, and (b) normalized strength=2{squareroot}, (λ_(1+λ) ₂), where λ1 and λ2 are the two eigen values of matrixgiven in equation (1) and λ₁>λ₂. In the present embodiment of theinvention, the combination of anisotropy and the strength is used as thetexture measure. It is called ‘texture strength’ (S), where S=anisotropyx normalized strength.

[0039] The texture strength S for the original color image (FIG. 2) isgiven in FIG. 4. Higher brightness implies higher textural strength.Texture strength captures the textural variations in the image wellexcept on the edges as it tends to give more weight to edges.

[0040] Unsupervised Learning

[0041] Unsupervised learning refers to learning the similarities in thedata without using any labeled training set. Presently, unsupervisedlearning has used a clustering procedure 20. Clustering has been usedhere to perform exploratory data analysis thereby to gain an insightinto the structure of the data belonging to the input image. Thisresults in groups of patterns whose members are more similar to eachother than they are to other patterns. Thus, unsupervised learningenforces the idea of similarity in the observed data without relying onany training set.

[0042] As true with the well known clustering procedures, the databelonging to various clusters is assumed to follow a known probabilitydistribution, and learning therefore amounts to estimating theparameters of this distribution. However, the present embodiment takes aslight deviation from the way clustering is interpreted in a traditionalsense. Instead of assigning each image pixel to a particular cluster,the present embodiment assigns a probability of association to eachpixel with a particular cluster. This is called ‘soft’ or probabilisticclustering. In this embodiment of the invention, unsupervised learning14 was intended to refine the results obtained from the supervisedlearning 16 by enforcing the pixel similarities in a given image. The‘soft’ paradigm can be readily incorporated in a Bayesian approach forthis purpose. In fact, it can be shown that all the ‘hard’ ordeterministic clustering schemes are special cases of ‘soft’ clusteringwhere a decision is made to assign a pixel to a particular cluster onthe basis of some hypothesis, e.g., Maximum A Posteriori (MAP). Tounderstand the flaw in ‘hard’ clustering, consider the MAP schemeapplied to the following example. According to MAP, a pixel is assignedto that cluster for which posterior probability of the cluster given apixel is highest. In a two-cluster case, for a certain pixel, if theposterior of the first cluster is 0.51 while for the other it is 0.49,MAP will assign the pixel to the first cluster even though intuitivelyit is really hard to predict concretely anything about the pixel clusterassociation. The difference of posteriors might be just due to somestatistically insignificant or irrelevant factors.

[0043] The present embodiment assumes that the underlying clusters arenormally distributed. A weak justification of this assumption comes fromthe central limit theorem and strong law of large numbers. As per theabove assumption, the image data follows a mixture of Gaussiandistribution. According to this, the density of the observed data can begiven by, $\begin{matrix}{{p(x)} = {\sum\limits_{j = 1}^{K}{{p\left( {x/j} \right)}{P(j)}}}} & \left( {{Eq}.\quad 2} \right)\end{matrix}$

[0044] where x is the pixel data, K is the number of clusters (orcomponents). As per this model, each data point is generated by firstchoosing a cluster with probability P(j) and then generating the datapoint with probability p(x/j) which is a gaussian (by assumption), i.e,$\begin{matrix}{{p\left( {x/j} \right)} = {\frac{1}{\left| {2\pi \quad C_{j}} \right|^{1/2}}\exp \left\{ {{- \frac{1}{2}}\left( {x - \mu_{j}} \right)^{T}{C_{j}^{- 1}\left( {x - \mu_{j}} \right)}} \right\}}} & \left( {{Eq}.\quad 3} \right)\end{matrix}$

[0045] where μ_(j) is the mean and C_(j) is the covariance matrix ofcluster j . There are two main issues to be addressed while learning thedensity given in equation (2), i.e.,

[0046] How many clusters are there in the given image (the componentselection stage 18).

[0047] What are the density parameters: mixing parameters P(j), mean andcovariance matrix of each gaussian (parameter estimation).

[0048] Component Selection

[0049] In the absence of dogmatic priors, most of the parameterestimation techniques maximize the likelihood of the observed data.These are known as Maximum Likelihood (ML) techniques. However, thecomponent selection is an ideal example of a problem where the MLapproach fails. The reason of failure is that as the number ofcomponents increases in the mixture model, the likelihood of data alwaysincreases. Thus, in the limit, we will have one component for each datapoint of the observed data.

[0050] Several approaches have been proposed under the topic of modelselection to overcome this problem. A widely used technique is MinimumDescription Length (MDL) proposed by Rissanen (Rissanen, J., “Modelingby Shortest Data Description”, Automatica, vol. 14, p. 465-471, 1978,which is incorporated herein by reference). In MDL, while maximizing thedata likelihood, a penalty is imposed as the number of parametersincreases. The description length (DL) is given as:

DL=−log p(X/θ)+(l/2) log n  (Eq. 4)

[0051] where X is the overall observed data, θ is the parameter vectorcontaining all the model parameters, l is the number of parameters and nis the data size. The first term in the RHS of equation (4) is thenegative of log likelihood (i.e., code length of likelihood) and secondterm is code length of the parameters, which acts as a penalty term asthe number of parameters increases. Each parameter requires a codelength proportional to (½) log n, which is the precision with which theycan be encoded (where estimation precision is measured by the estimationerror standard deviation) (see Hansen, M. H., and Yu, B., “ModelSelection and the Principle of Minimum Description Length”, JASA, Vol.96, No. 454, pp. 746-774, which is incorporated herein by reference).The number of components in the mixture model is chosen by minimizingthe DL. However, there are two limitations while implementing thiscriterion. First, MDL needs iterative schemes to compute the likelihoodwhich is prone to getting stuck in local extrema, and second, theiterative schemes are fairly slow and become almost impractical foronline unsupervised learning. Thus, we use a Kullback-Leibler Divergencebased method, which is described in Bishop, C. M., Neural Networks forPattern Recognition, Oxford University Press, 1995, to estimate thenumber of components in the image.

[0052] Kullback-Leibler Divergence (KLD) is used for comparing twodensities. $\begin{matrix}{{KLD} = {\int{{\overset{\sim}{p}(x)}\log \frac{p(x)}{\overset{\sim}{p}(x)}{x}}}} & \left( {{Eq}.\quad 5} \right)\end{matrix}$

[0053] where x is the data, {tilde over (p)}(x) is the true density andp(x) is the model density. It can be shown that KLD≧0 with equality ifand only if the two density functions are equal. KLD is not a symmetricfunction (hence not a metric). This is reasonable since it is moreimportant for the model distribution p(x) to be close to the truedistribution {tilde over (p)}(x) in regions where data is more likely tobe found.

[0054] To apply the KLD in the component selection process, the modeldensity is given by the mixture of Gaussian in equation (2) but the truedensity is not known. The color histogram is assumed to berepresentative of the true density. This assumption is reasonable if theimage is composed of smooth and homogeneous regions, as is the case withnatural scene images. Also, it is necessary to learn the model densitywithout explicitly computing the parameters by maximizing thelikelihood. This is done by incrementally fitting Gaussians on the modesof the color histogram. A Laplacian method can yield a good normalapproximation of the non-normal densities (see Tanner, M. A., Tools forStatistical Inference, 3^(rd) Ed., Springer-Verlag, Inc. 1996). As perthis method, parameters of a Gaussian are obtained by matching thecurvature of the Gaussian with that of the original density at the mode.This is equivalent to using the negative of the inverse of the Hessianmatrix at the mode of original density as the covariance matrix of theGaussian. While this technique may be implemented by computing theHessian matrix using Singular Value Decomposition (SVD) on the localneighborhood of the mode of the histogram, most of the Hessians obtainedusing this approach were found to be almost singular because colorhistograms usually show sharp discontinuities at the modes. Thus, asimpler approach was implemented by assuming the covariance matrices tobe scalar multiples of the identity matrix. The peak of the Gaussian wasmatched with the peak of the histogram at the mode. The scalar multiplewas computed using the property that a proper density should integrateto 1. As per this, if the height of the histogram at a given mode ish_(m), and the covariance matrix C is given as, C=σI, where I is theidentity matrix and σ is a constant, then $\begin{matrix}{h_{m} = {{\frac{1}{\left. {2\pi} \middle| C \right|^{1/2}}\quad {or}\quad \sigma} = \frac{1}{2\pi \quad h_{m}}}} & \left( {{Eq}.\quad 6} \right)\end{matrix}$

[0055] As per the proposed method, the first estimate of the modeldensity is obtained by fitting the first Gaussian at the highest mode ofthe histogram, and the KLD between the color histogram and the modeldensity is computed. Again, a new density is computed by fitting onemore Gaussian on the next highest mode of the histogram and mixing thetwo Gaussian as per equation (2). The mixing parameters have beenassumed to be uniform. As per the above formulation, as the number ofGaussians are increased, KL divergence decreases and it will eitherstabilize or start increasing when the number of Gaussians grow beyondthe natural regions represented in the image depending on how smooth thecolor histogram is. For the color image shown in FIG. 2, the colorhistogram in g-RGB space is given in FIG. 5. FIG. 6 shows the changes inKL Divergence when the number of components (or Gaussians), K areincreased. As K is increased from 1 to 2, sharp decrease in KLD can benoticed. After K=5, the KLD almost stabilizes indicating thenon-significance of further increments in K. Hence, we have chosen 5 asthe number of clusters in the original image. Intuitively, one canobserve five broad categories in the input image i.e., sky, water, redwall, floor/skin and tree.

[0056] Parameter Estimation

[0057] Once the number of clusters has been ascertained, next step is toestimate the parameters of the model given in equations (2) and (3),i.e., mean μ_(j) and covariance matrix C_(j) for cluster j, and themixing parameters P(j) for j=1, . . . , K. The traditional MaximumLikelihood (ML) approach is used to estimate these parameters. Themaximum likelihood solution yields highly non-linear coupled equationswhich cannot be solved simultaneously. So we need an iterative techniqueto solve these equations. The preferred embodiment usesExpectation-Maximization (EM), first suggested by Dempster et al.(Dempster, A., Laird, N. and Rubin, D., “Maximum Likelihood fromIncomplete Data Via the EM Algorithm”, J. Royal Statistical Soc., Ser.B, vol. 39, No. 1, pp. 1-38, 1977, which is incorporated herein byreference), as the optimization technique to find the parameters thatmaximize the likelihood. EM has been used widely for problems withincomplete data. In EM optimization, the E-step is equivalent to findinga lower bound g(θ) that touches the likelihood function on the currentguess of parameters θ^(old), and the M-step is equivalent to finding thenew parameter values θ_(new) that maximize this lower bound.

[0058] If x_(i) denotes the data associated with the i^(th) pixel andthere are n pixels in the image, the iterative equations used for EM aregiven by, $\begin{matrix}{\mu_{j}^{new} = \frac{\sum\limits_{i = 1}^{n}{x_{i}{P^{old}\left( {j/x_{i}} \right)}}}{\sum\limits_{j = 1}^{n}{P^{old}\left( {j/x_{i}} \right)}}} & \left( {{Eq}.\quad 7} \right) \\{C_{j}^{new} = \frac{\sum\limits_{i = 1}^{n}{{P^{old}\left( {j/x_{i}} \right)}\left( {x_{i} - \mu_{j}^{new}} \right)\left( {x_{i} - \mu_{j}^{new}} \right)^{T}}}{\sum\limits_{i = 1}^{n}{P^{old}\left( {j/x_{i}} \right)}}} & \left( {{Eq}.\quad 8} \right) \\{{P^{new}(j)} = {\frac{1}{n}{\sum\limits_{i = 1}^{n}{P^{old}\left( {j/x_{i}} \right)}}}} & \left( {{Eq}.\quad 9} \right)\end{matrix}$

[0059] where all the terms except P(j/x_(i)) have been defined earlierin equations (2) and (3). P(j/x_(i)) is the posterior probability whichindicates the probability of i^(th) pixel belonging to cluster j. Theposterior probability is computed using Bayes rule as, $\begin{matrix}{{p^{old}\left( {j/x_{i}} \right)} = \frac{{p^{old}\left( {x_{i}/j} \right)}{P^{old}(j)}}{\sum\limits_{j = 1}^{K}{{p^{old}\left( {x_{i}/j} \right)}{P^{old}(j)}}}} & \left( {{Eq}.\quad 10} \right)\end{matrix}$

[0060] where the conditional densities p(j/x_(i)/j) are obtained usingequation (3). The posterior probabilities given by equation (10) formthe main results of the unsupervised learning. For each cluster j, theposterior probabilities are computed for each image pixel and the mapshowing these probabilities is called a cluster probability map 26 (seeFIG. 1). These maps for five clusters in the original image (FIG. 2) aregiven in FIG. 7.

[0061] In these cluster probability maps, a brighter pixel indicates ahigher probability. It is clear from FIG. 7, that different clustershave captured the similarity in pixels belonging to different semanticclasses. In increasing order of j, the maps broadly represent grass, redwall, sky, floor and water. However, several deficiencies can be seen inthe cluster maps. Especially, the semantic classes that representrelatively small regions have been merged with other clusters, e.g. skinregions were merged with red wall or floor. There is some intuitivelyobvious misclustering e.g., for j=1, parts of sky, water and swimmingcostume have been clustered along with the grass class. However, itshould be emphasized that at this stage the algorithm has no notion of asemantic class. The results of probabilistic clustering are purely basedon the ‘similarity’ of the data points. The main aim of unsupervisedlearning was not to obtain perfect clustering but to obtain aprobabilistic estimate of cohesiveness in various pixels, which can belater used for refining the results of supervised learning.

[0062] Supervised Learning

[0063] In the present embodiment of the invention, supervised learning16 (see FIG. 1) has been used for assigning each image pixel aprobability of association with every semantic class belonging to therecognition vocabulary. In our current example, the vocabulary containsfive classes i.e., sky, water, sand/soil, skin and grass/tree. The term‘supervised’ is derived from the fact that labeled training data isused. A generative approach is taken to obtain probabilities ofassociation between a pixel and a class. In a generative approach, it isassumed that data belonging to each class comes from a true but unknownprobability distribution. Mixture models are used to represent thedensity of data belonging to different classes.

[0064] Mixture Model

[0065] Mixture models belong to the semi-parametric class of models,which combine the advantages of both parametric and nonparametricmodels. Unlike parametric models, mixture models are not restricted toany functional form and unlike the nonparametric models, the size of amixture model grows with the complexity of the problem being solved, notsimply with the size of the data set. An important property of themixture models is that, for many choices of component density function,they can approximate any continuous density to arbitrary accuracyprovided the model has sufficiently large number of components and theparameters of the model are chosen correctly (see Bishop, C. M., NeuralNetworks for Pattern Recognition, Oxford University Press, 1995). Themixture model in the present embodiment of the invention is a mixture ofGaussians. Mixture models can capture significant inherent variations(i.e., multimodality) in the data belonging to the same semantic class.An intuitive justification of mixture model can be seen in FIG. 8.Suppose one is training for the data belonging to the class ‘sky’. Thepixels belonging to clear blue sky look quite similar and may berepresented with a Gaussian 40. However, the sky pixels belonging towhite cloudy regions may require another Gaussian blob 42 to representthe similar cloudy pixels. Thus, the pixels belonging to the sky classcan be generated by mixing the two blobs, yielding a mixture density 44.The red curve shows the mixture of two Gaussians. The example uses justtwo Gaussians, but in practice more may be needed depending on the data.It should be noted that all the parameters of the model including numberof Gaussians required to capture the inherent variations in a class arelearnt automatically from the training data.

[0066] According to the mixture of Gaussian model, the class conditionaldensity is given by, $\begin{matrix}{{p\left( {y/\omega} \right)} = {\sum\limits_{m = 1}^{M}{{p\left( {{y/m},\omega} \right)}{P\left( {m/\omega} \right)}}}} & \left( {{Eq}.\quad 11} \right)\end{matrix}$

[0067] where y is a data point belonging to the training set of class ωand M is the number of gaussians (or components). As per this model,each data point y belonging to class ω is generated by first choosing aGaussian component with probability P(m/ω) and then generating the datapoint with probability p(x/m, ω) which is a gaussian given by,$\begin{matrix}{{p\left( {{x/m},\omega} \right)} = {\frac{1}{\left| {2\quad \pi \quad C_{m}} \right|^{1/2}}\exp \left\{ {{- \frac{1}{2}}\left( {x - \mu_{m}} \right)^{T}{C_{m}^{- 1}\left( {x - \mu_{m}} \right)}} \right\}}} & \left( {{Eq}.\quad 12} \right)\end{matrix}$

[0068] where μ_(m) is the mean and C_(m) is the covariance matrix forcomponent m. There are two main issues to be addressed to learn thedensity given in equation (11), i.e.,

[0069] How many components are required to learn the density of a givenclass (component selection).

[0070] What are the density parameters: P(m/ω), mean and covariancematrix for each gaussian (parameter estimation).

[0071] It may be noted that this is the identical problem solved in theunsupervised learning except that, in supervised learning, data is notthe input image but the labeled training set belonging to a particularclass. Accordingly, the KL Divergence based scheme is used for componentselection as outlined above. All other parameters of the mixture modelare learnt by using EM. Iterative equations similar to the ones given inequations (7)-(10) were used. As a brief note for the componentselection, if the number of components are chosen more than a threshold,it does not do any significant harm as the EM generates a small mixingprobability, i. e., P(m/ω) for extra clusters. So, if a componentselection scheme can suggest such a threshold, it would suffice. Ofcourse, the learning process slows down with the increase in number ofcomponents.

[0072] The output of the supervised learning is a class probability map28 (see FIG. 1) for each class in the recognition vocabulary. A classprobability map shows the probability of each pixel to have come from agiven class, i.e. p(x/ω). A class probability map corresponding to class‘grass/tree’ of input image (FIG. 2) is given in FIG. 9. A brighterpixel indicates higher probability. It can be seen that the generativemodel 22 has correctly classified the grass/tree regions with highprobability. However, there are several non-grass areas with significantprobability of being grass, e.g. parts of water, wet floor etc. Althoughprobabilities in these regions are not very high, they have thepotential of misleading the Bayesian refinement of class maps whileimplementing an iterative feedback scheme. Similar maps are obtained forthe other classes in the recognition vocabulary.

[0073] Merging Unsupervised and Supervised Learning

[0074] The main inspiration for merging the cluster probability maps 26,obtained from unsupervised learning 14, and class probability maps 28,obtained from supervised learning 16, comes from the possibility ofmodification and refinement of the class probability maps. As discussedbefore, the cluster probability maps 26 capture the similarity ofvarious pixels in a single image while the class probability maps 28capture the similarity of image pixels and the training data belongingto different classes. The class probability maps 28 are generally notperfect as the generative models are learnt from the training data andobtained from images captured under different physical conditions, e.g.,variations in illumination. Thus, it might happen that some of thepixels in a given image might look ‘close’ to the training data of aparticular class even though the pixels actually do not belong to thatclass. This leads to false alarms in the class probability maps. It canbe seen in the class probability map for class ‘grass/tree’ (FIG. 9),parts of wet floor and water regions have been assigned a significantprobability of being grass. So, the hope is if the concept of pixelsimilarity can be enforced within a single given image, it may be ableto provide better class probability maps.

[0075] The first step towards merging the cluster and class probabilitymaps 26 and 28 is to find which cluster probability maps have highprobability of representing a given class. This is done by maximizingthe class conditional probability given a cluster, i.e. P(ω_(i)/j). Tocompute the class posterior, first the posterior of a cluster given aclass is considered as,

p(j/ω_(i))=∫p(j ,x/ω _(i))dx

[0076] This is from the definition of marginal distribution. Using Bayeslaw on the integrand in RHS,

p(j/ω_(i))=∫p(x/ω_(i))p(j/x,ω _(i))dx

[0077] But in the second term of the above integrand, clustering isindependent of the class given the data, i.e. p(j/x, ω_(i))=p(j/x). Thisis the case of conditional independence. So,

p(j/ω_(i))=∫p(x/ω_(i))p(j/ω _(i))dx

[0078] Since we are dealing with discrete data and the probability massshould add to one, the cluster conditional is given by, $\begin{matrix}{{P\left( {j/\omega_{i}} \right)} = \frac{\sum\limits_{x}{{P\left( {j/x} \right)}{P\left( {x/\omega_{i}} \right)}}}{\sum\limits_{j}{\sum\limits_{x}{{P\left( {j/x} \right)}{P\left( {x/\omega_{i}} \right)}}}}} & \left( {{Eq}.\quad 13} \right)\end{matrix}$

[0079] Using equation (13), the class conditional can be computed easilyfrom the Bayes rule, $\begin{matrix}{{P\left( {\omega_{i}/j} \right)} = \frac{{P\left( {j/\omega_{i}} \right)}{P\left( \omega_{i} \right)}}{\underset{i = 1}{\sum\limits^{W}}{{P\left( {j/\omega_{i}} \right)}{P\left( \omega_{i} \right)}}}} & \left( {{Eq}.\quad 14} \right)\end{matrix}$

[0080] As before, it is assumed the class priors P(ω_(i)) to be uniform.

[0081] The posterior probabilities of different classes given a clusterhave been plotted. FIG. 10(a)-(e) gives the plots corresponding to fivecluster probability maps. It is clear from the plots that cluster j=1has much higher probability of being from class ‘grass/tree’ than theother four classes. Also, this is the only cluster that has high‘grass/tree’ probability. So, the cluster probability map correspondingto j=1 (FIG. 7(a) ) is used to refine the class probability map forclass ‘grass/tree’ (FIG. 9). This correspondence between the chosenclass and cluster probability maps is obvious. It should be mentionedthat there might be more than one cluster, which could correspond to thesame class depending on how the clustering was done. This scenario canbe easily included in the modification scheme of the class probabilitymap.

[0082] Once a cluster (or clusters) corresponding to a class has beenchosen, the modified class probability map 30 is computed by weightingeach pixel probability of the class probability map by the correspondingpixel probability of the cluster probability map. The modified classprobability map of class ‘grass/tree’ is given in FIG. 11. Theprobabilities of false grass regions, e.g., parts of floor, water andskin, have been significantly reduced. In case more than one clustercorresponds to the same class, the modified maps using each cluster aremerged using the mixing ratio given by mixing parameters P(j) learnt byEM during unsupervised learning.

[0083] Referring to FIG. 12, there is illustrated a computer system 110for implementing the present invention. Although the computer system 110is shown for the purpose of illustrating a preferred embodiment, thepresent invention is not limited to the computer system 110 shown, butmay be used on any electronic processing system such as found in homecomputers, kiosks, retail or wholesale photofinishing, or any othersystem for the processing of digital images. The computer system 110includes a microprocessor-based unit 112 for receiving and processingsoftware programs and for performing other processing functions. Adisplay 114 is electrically connected to the microprocessor-based unit112 for displaying user-related information associated with thesoftware, e.g., by means of a graphical user interface. A keyboard 116is also connected to the microprocessor based unit 112 for permitting auser to input information to the software. As an alternative to usingthe keyboard 116 for input, a mouse 118 may be used for moving aselector 120 on the display 114 and for selecting an item on which theselector 120 overlays, as is well known in the art.

[0084] A compact disk-read only memory (CD-ROM) 122 is connected to themicroprocessor based unit 112 for receiving software programs and forproviding a means of inputting the software programs and otherinformation to the microprocessor based unit 112 via a compact disk 124,which typically includes a software program. In addition, a floppy disk126 may also include a software program, and is inserted into themicroprocessor-based unit 112 for inputting the software program. Stillfurther, the microprocessor-based unit 112 may be programmed, as is wellknown in the art, for storing the software program internally. Themicroprocessor-based unit 112 may also have a network connection 127,such as a telephone line, to an external network, such as a local areanetwork or the Internet. A printer 128 is connected to themicroprocessor-based unit 112 for printing a hardcopy of the output ofthe computer system 110.

[0085] Images may also be displayed on the display 114 via a personalcomputer card (PC card) 130, such as, as it was formerly known, a PCMCIAcard (based on the specifications of the Personal Computer Memory CardInternational Association) which contains digitized imageselectronically embodied in the card 130. The PC card 130 is ultimatelyinserted into the microprocessor based unit 112 for permitting visualdisplay of the image on the display 114. Images may also be input viathe compact disk 124, the floppy disk 126, or the network connection127. Any images stored in the PC card 130, the floppy disk 126 or thecompact disk 124, or input through the network connection 127, may havebeen obtained from a variety of sources, such as a digital camera 134 ora scanner 136 (for example, by scanning an original, such as a silverhalide film). The digital camera 134 may also download images to thecomputer system through a 15 communications link 140 (e.g., an RF or IRlink). In accordance with the invention, the algorithm may be stored inany of the storage devices heretofore mentioned and applied to images inorder to process and classify the images.

[0086] The invention has been described with reference to a preferredembodiment. However, it will be appreciated that variations andmodifications can be effected by a person of ordinary skill in the artwithout departing from the scope of the invention.

PARTS LIST

[0087]10 input color image

[0088]12 feature extraction stage

[0089]14 unsupervised learning stage

[0090]16 supervised learning stage

[0091]18 component selection stage

[0092]20 clustering algorithm

[0093]22 generative model

[0094]24 labeled training data

[0095]26 cluster probability map

[0096]28 class probability map

[0097]30 modified class probability map

[0098]40 clear blue sky Gaussian

[0099]42 cloudy sky Gaussian

[0100]44 mixture density

[0101]110 computer system

[0102]112 microprocessor-based unit

[0103]114 display

[0104]116 keyboard

[0105]118 mouse

[0106]120 selector

[0107]122 CD-ROM

[0108]124 compact disk

[0109]126 floppy disk

[0110]127 network connection

[0111]128 printer

[0112]130 PC card

[0113]132 card reader

[0114]134 digital camera

[0115]136 scanner

[0116]140 communications link

What is claimed is:
 1. A method for classification of image regions by probabilistic merging of a class probability map and a cluster probability map, said method comprising the steps of: a) extracting one or more features from an input image composed of image pixels; b) performing unsupervised learning based on the extracted features to obtain a cluster probability map of the image pixels; c) performing supervised learning based on the extracted features to obtain a class probability map of the image pixels; and d) combining the cluster probability map from unsupervised learning and the class probability map from supervised learning to generate a modified class probability map to determine the semantic class of the image regions.
 2. The method as claimed in claim 1 wherein the extracted features include color and textual features.
 3. The method as claimed in claim 1 wherein the unsupervised learning in step b) comprises the steps of: determining number of clusters in the input image; estimating the parameters of a probabilistic model describing the clusters; and assigning each image pixel to one of the clusters according to the probabilistic model.
 4. The method as claimed in claim 1 wherein the supervised learning of step c) comprises the steps of: creating a labeled training set belonging to a particular class; determining a number of components required to learn a density function of a given class with the labeled training set as input; estimating parameters of each density function in a mixture model; and assigning each image pixel to one of the classes according to the mixture model.
 5. The method as claimed in claim 1 wherein the unsupervised learning of step b) comprises the steps of: determining a number of clusters in the input image using a Kullback-Leibler (KL) divergence method; estimating mean and covariance parameters of a normally distributed probabilistic model describing the clusters using an Expectation-Maximization (EM) technique; and assigning each image pixel to one of the clusters according to the normally distributed probabilistic model by computing a posterior probability using Bayes rule.
 6. The method as claimed in claim 1 wherein the supervised learning of step c) comprises the steps of: creating a labeled training set belonging to a particular class; determining a number of components required to learn a density function of a given class with the labeled training set as input, using a Kullback-Leibler (KL) divergence method; estimating the mean and covariance parameters of each density function in a Gaussian mixture model using an Expectation-Maximization (EM) technique; and assigning each image pixel to one of the classes according to the Gaussian mixture model.
 7. The method as claimed in claim 1 wherein step d) comprises the steps of: maximizing a joint likelihood of class and cluster by computing a class conditional probability using Bayes rule; assigning each of the cluster probability maps to one of the classes according to the class conditional probability; and computing the modified class probability map by weighting each pixel probability of the class probability map by the corresponding pixel probability of the cluster probability map.
 8. The method as claimed in claim 1 wherein step a) comprises the step of extracting and computing low-level features selected from the group including color, texture, shapes, and wavelet coefficients from the input image.
 9. The method as claimed in claim 1 wherein step a) comprises the step of detecting and extracting semantic-level features selected from the group including faces, people, and structures from the input image.
 10. A computer program product for classification of image regions by probabilistic merging of a class probability map and a cluster probability map comprising: a computer readable storage medium having a computer program stored thereon for performing the steps of a) extracting one or more features from an input image composed of image pixels; b) performing unsupervised learning based on the extracted features to obtain a cluster probability map of the image pixels; c) performing supervised learning based on the extracted features to obtain a class probability map of the image pixels; and d) combining the cluster probability map from unsupervised learning and the class probability map from supervised learning to generate a modified class probability map to determine the semantic class of the image regions.
 11. The computer program product as claimed in claim 10 wherein the extracted features include color and textual features.
 12. The computer program product as claimed in claim 10 wherein the unsupervised learning in step b) comprises the steps of: determining number of clusters in the input image, estimating the parameters of a probabilistic model describing the clusters; and assigning each image pixel to one of the clusters according to the probabilistic model.
 13. The computer program product as claimed in claim 10 wherein the supervised learning of step c) comprises the steps of: creating a labeled training set belonging to a particular class; determining a number of components required to learn a density function of a given class with the labeled training set as input; estimating parameters of each density function in a mixture model; and assigning each image pixel to one of the classes according to the mixture model.
 14. The computer program product as claimed in claim 10 wherein the unsupervised learning of step b) comprises the steps of: determining a number of clusters in the input image using a Kullback-Leibler (KL) divergence method; estimating mean and covariance parameters of a normally distributed probabilistic model describing the clusters using an Expectation-Maximization (EM) technique; and assigning each image pixel to one of the clusters according to the normally distributed probabilistic model by computing a posterior probability using Bayes rule.
 15. The computer program product as claimed in claim 10 wherein the supervised learning of step c) comprises the steps of: creating a labeled training set belonging to a particular class; determining a number of components required to learn a density function of a given class with the labeled training set as input, using a Kullback-Leibler (KL) divergence method; estimating the mean and covariance parameters of each density function in a Gaussian mixture model using an Expectation-Maximization (EM) technique; and assigning each image pixel to one of the classes according to the Gaussian mixture model.
 16. The computer program product as claimed in claim 10 wherein step d) comprises the steps of: maximizing a joint likelihood of class and cluster by computing a class conditional probability using Bayes rule; assigning each of the cluster probability maps to one of the classes according to the class conditional probability; and computing the modified class probability map by weighting each pixel probability of the class probability map by the corresponding pixel probability of the cluster probability map.
 17. The computer program product as claimed in claim 10 wherein step a) comprises the step of extracting and computing low-level features selected from the group including color, texture, shapes, and wavelet coefficients from the input image.
 18. The computer program product as claimed in claim 10 wherein step a) comprises the step of detecting and extracting semantic-level features selected from the group including faces, people, and structures from the input image. 